ISSS608-VAA
  • Hands-on Exercise
    • Hands-on Exercise 1
    • Hands-on Exercise 2
    • Hands-on Exercise 3
    • Hands-on Exercise 4a
    • Hands-on Exercise 4b
    • Hands-on Exercise 4c
    • Hands-on Exercise 5a
  • In-Class Exercise
    • In-Class Exercise 3
    • In-Class Exercise 4
    • In-Class Exercise 5
  • Take-Home Exercise
    • Take-Home Exercise 1
    • Take-Home Exercise 2
    • Take-Home Exercise 3
  • Home

On this page

  • 1. Task
  • 2. Dataset
    • 2.1 Data Source
    • 2.2 Data Preparation
  • 3. Data Exploration & Visualizations
    • 3.1 Resale Flats by Flat Models
    • 3.2 Resale Flats by Flat Types
    • 3.3 Resale Prices vs Flat Models
    • 3.4 Resale Prices by Flat Types
    • 3.5 Resale Prices vs Floor Area
    • 3.6 Resale Prices vs Floor Area by Month

Take-Home Exercise 3

Author

Bhairavi

Published

February 15, 2023

The following is my submission for Take-Home Exercise 3.

1. Task

In this take-home exercise, we are required to uncover the salient patterns of the resale prices of public housing property by residential towns and estates in Singapore by using appropriate analytical visualization techniques learned in Lesson 4: Fundamentals of Visual Analytics. We are encouraged to apply appropriate interactive techniques to enhance user and data discovery experiences.

For the purpose of this study, the focus would be on 3-ROOM, 4-ROOM and 5-ROOM types. We can choose to focus on either one housing type or multiple housing types. In my case, I will be focusing on multiple housing types for the year 2022.

2. Dataset

2.1 Data Source

The dataset used in this take-home exercise was downloaded from the following link: https://data.gov.sg/dataset/resale-flat-prices

We are using the Resale flat prices based on registration date from Jan-2017 onwards csv file for this exercise.

2.2 Data Preparation

First, let’s load the dataset and the libraries that we will be using:

pacman::p_load(tidyverse, lubridate, plotly, treemap, ggstatsplot)
rawData <- read_csv("resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

Let’s take a look at this dataset:

head(rawData)
# A tibble: 6 × 11
  month   town     flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
  <chr>   <chr>    <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl> <chr>  
1 2017-01 ANG MO … 2 ROOM  406   ANG MO… 10 TO …      44 Improv…    1979 61 yea…
2 2017-01 ANG MO … 3 ROOM  108   ANG MO… 01 TO …      67 New Ge…    1978 60 yea…
3 2017-01 ANG MO … 3 ROOM  602   ANG MO… 01 TO …      67 New Ge…    1980 62 yea…
4 2017-01 ANG MO … 3 ROOM  465   ANG MO… 04 TO …      68 New Ge…    1980 62 yea…
5 2017-01 ANG MO … 3 ROOM  601   ANG MO… 01 TO …      67 New Ge…    1980 62 yea…
6 2017-01 ANG MO … 3 ROOM  150   ANG MO… 01 TO …      68 New Ge…    1981 63 yea…
# … with 1 more variable: resale_price <dbl>, and abbreviated variable names
#   ¹​flat_type, ²​street_name, ³​storey_range, ⁴​floor_area_sqm, ⁵​flat_model,
#   ⁶​lease_commence_date, ⁷​remaining_lease
summary(rawData)
    month               town            flat_type            block          
 Length:146872      Length:146872      Length:146872      Length:146872     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 street_name        storey_range       floor_area_sqm   flat_model       
 Length:146872      Length:146872      Min.   : 31.0   Length:146872     
 Class :character   Class :character   1st Qu.: 82.0   Class :character  
 Mode  :character   Mode  :character   Median : 94.0   Mode  :character  
                                       Mean   : 97.6                     
                                       3rd Qu.:113.0                     
                                       Max.   :249.0                     
 lease_commence_date remaining_lease     resale_price    
 Min.   :1966        Length:146872      Min.   : 140000  
 1st Qu.:1985        Class :character   1st Qu.: 358000  
 Median :1996        Mode  :character   Median : 448000  
 Mean   :1996                           Mean   : 478316  
 3rd Qu.:2007                           3rd Qu.: 565800  
 Max.   :2019                           Max.   :1418000  

As you can see, the month column is in the format of YYYY-MM. Using the month column, we can actually create two new columns - Date (in YYYY-MM-DD format) and Year. These fields will be useful when building visualizations.

data <- rawData %>%
  mutate(Year = year(ym(month))) %>%
  mutate(Date = as.Date(paste(month, "-01", sep = ""),
                        format = "%Y - %m - %d"))
summary(data)
    month               town            flat_type            block          
 Length:146872      Length:146872      Length:146872      Length:146872     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 street_name        storey_range       floor_area_sqm   flat_model       
 Length:146872      Length:146872      Min.   : 31.0   Length:146872     
 Class :character   Class :character   1st Qu.: 82.0   Class :character  
 Mode  :character   Mode  :character   Median : 94.0   Mode  :character  
                                       Mean   : 97.6                     
                                       3rd Qu.:113.0                     
                                       Max.   :249.0                     
 lease_commence_date remaining_lease     resale_price          Year     
 Min.   :1966        Length:146872      Min.   : 140000   Min.   :2017  
 1st Qu.:1985        Class :character   1st Qu.: 358000   1st Qu.:2018  
 Median :1996        Mode  :character   Median : 448000   Median :2020  
 Mean   :1996                           Mean   : 478316   Mean   :2020  
 3rd Qu.:2007                           3rd Qu.: 565800   3rd Qu.:2021  
 Max.   :2019                           Max.   :1418000   Max.   :2023  
      Date           
 Min.   :2017-01-01  
 1st Qu.:2018-09-01  
 Median :2020-07-01  
 Mean   :2020-03-27  
 3rd Qu.:2021-10-01  
 Max.   :2023-02-01  

Since we will only be focusing on the year 2022, we will filter the dataset to only keep records for the year 2022. Also, we only want to work with records which have a flat type of 3-Room, 4-Room or 5-Room. Let’s filter the data to meet this condition.

resaleFlats <- data %>%
  filter(Year == "2022") %>%
  filter(flat_type %in% c("3 ROOM", "4 ROOM", "5 ROOM"))
head(resaleFlats)
# A tibble: 6 × 13
  month   town     flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
  <chr>   <chr>    <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl> <chr>  
1 2022-01 ANG MO … 3 ROOM  320   ANG MO… 07 TO …      73 New Ge…    1977 54 yea…
2 2022-01 ANG MO … 3 ROOM  225   ANG MO… 07 TO …      67 New Ge…    1978 55 yea…
3 2022-01 ANG MO … 3 ROOM  331   ANG MO… 07 TO …      68 New Ge…    1981 58 yea…
4 2022-01 ANG MO … 3 ROOM  534   ANG MO… 07 TO …      82 New Ge…    1980 57 yea…
5 2022-01 ANG MO … 3 ROOM  578   ANG MO… 04 TO …      67 New Ge…    1980 57 yea…
6 2022-01 ANG MO … 3 ROOM  452   ANG MO… 01 TO …      83 New Ge…    1979 56 yea…
# … with 3 more variables: resale_price <dbl>, Year <dbl>, Date <date>, and
#   abbreviated variable names ¹​flat_type, ²​street_name, ³​storey_range,
#   ⁴​floor_area_sqm, ⁵​flat_model, ⁶​lease_commence_date, ⁷​remaining_lease

The data is now ready for visualization.

3. Data Exploration & Visualizations

Let’s first explore the fields we have in our dataset, such as the unique values we have in our key categorical variables - Town, Flat Types and Flat Models.

unique(resaleFlats$town)
 [1] "ANG MO KIO"      "BEDOK"           "BISHAN"          "BUKIT BATOK"    
 [5] "BUKIT MERAH"     "BUKIT PANJANG"   "BUKIT TIMAH"     "CENTRAL AREA"   
 [9] "CHOA CHU KANG"   "CLEMENTI"        "GEYLANG"         "HOUGANG"        
[13] "JURONG EAST"     "JURONG WEST"     "KALLANG/WHAMPOA" "MARINE PARADE"  
[17] "PASIR RIS"       "PUNGGOL"         "QUEENSTOWN"      "SEMBAWANG"      
[21] "SENGKANG"        "SERANGOON"       "TAMPINES"        "TOA PAYOH"      
[25] "WOODLANDS"       "YISHUN"         
unique(resaleFlats$flat_type)
[1] "3 ROOM" "4 ROOM" "5 ROOM"
unique(resaleFlats$flat_model)
 [1] "New Generation"         "Improved"               "Model A"               
 [4] "DBSS"                   "Simplified"             "Premium Apartment"     
 [7] "Standard"               "Model A-Maisonette"     "Model A2"              
[10] "Type S1"                "Type S2"                "Premium Apartment Loft"
[13] "Adjoined flat"          "Terrace"                "3Gen"                  
[16] "Improved-Maisonette"   

For this exercise, we will focus on the following areas:

  1. Resale Flats by Flat Models
  2. Resale Flats by Flat Types
  3. Resale Prices by Flat Models
  4. Resale Prices by Flat Types
  5. Resale Prices vs Floor Area
  6. Resale Prices vs Floor Area by Month

3.1 Resale Flats by Flat Models

The code chunk below aims to explore the number of resale flats by flat models.

Code
fig3 <- ggplot(data = resaleFlats,
               aes(x = flat_type,
                   fill = flat_type,
                   colour = flat_type)) +
  geom_bar(width = 0.5) +
  facet_wrap(~ flat_model) +
  labs(title = "Trend of Flat Models by Flat Types in Singapore Towns, 2022") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(fig3)

Based on the figure above, it is clear that certain flat models are more common that the others, and these common flat models are either Improved, Model A or New Generation. Another reason as to why the other flat models do not seem to have much data could be because we already filtered the dataset to only focus on 3 Room, 4 Room and 5 Room flats.

3.2 Resale Flats by Flat Types

The following code chunk aims to explore the number of resale flats by flat type for the various towns in Singapore.

Code
fig1 <- ggplot(data = resaleFlats,
               aes(x = flat_type,
                   fill = flat_type,
                   colour = flat_type)) +
  geom_bar(width = 0.5) +
  facet_wrap(~ town) +
  labs(title = "Trend of Flat Types in Singapore Towns, 2022") + 
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(fig1)

Based on the interactive plot above, we can deduce that Jurong East, Jurong West and Serangoon do not seem to have many resale flats in their towns. On the other hand, Punggol, Sengkang, Woodlands and Yishun seems to have a higher number of resale flats in their towns.

Let’s narrow down our data exploration by grouping the towns based on their average resale prices. The code chunk below groups the dataset by date, towns and their flat types and computes the average resale price for this grouping.

Code
groupedFlats <- resaleFlats %>%
  group_by(Date, town, flat_type) %>%
  summarise(avgPrice = mean(resale_price)) %>%
  ungroup()

groupedTowns <- resaleFlats %>%
  group_by(Date, town, flat_type) %>%
  summarise(avgPrice = mean(resale_price))

Another way to visualize the number of resale flats by flat type would be to plot a treemap. This treemap will allow us to gain quick insights into which town has a higher contribution to each flat type.

Code
level1 <- groupedTowns$flat_type
level2 <- groupedTowns$town
avgPrice <- groupedTowns$avgPrice
data <- data.frame(level1, level2, avgPrice)

treemap(data,
        index = c("level1", "level2"),
        vSize = c("avgPrice"),
        type = "index",
        palette = "Set2")

As seen in the treemap above, Central Area seems to have one of the highest average prices of resale flats across the different flat types in Singapore. Other towns which have a high average price of resale flats across the different flat types include Queenstown, Bishan and Bukit Merah.

3.3 Resale Prices vs Flat Models

One way to compare the resale prices by flat models would be to plot a violin plot to view the distribution as well as the mean price of the resale flats. The code chunk below intends to do that. We are focusing on the flat models of Improved, Model A and New Generation as these are the models which had a higher number of resale flats as per our exploration in previous steps.

Code
flatModels <- resaleFlats %>%
  filter(flat_model %in% c("Improved", "Model A", "New Generation"))

fig6 <- ggbetweenstats(
  data = flatModels,
  x = "flat_model",
  y = "resale_price",
  type = "np") +
  ggtitle("Non-Parametric Analysis of Resale Prices by Flat Model")

ggplotly(fig6)

From the above violin plot, we can observe that the mean of resale prices for Improved flats slightly higher that mean of resale price of Model A. On the other hand, the mean of resale price of New Generation flats are significantly lower.

3.4 Resale Prices by Flat Types

The following code chunk plots the average resale prices by flat types across 2022 for each town:

Code
fig2 <- ggplot(groupedTowns,
       aes(x = Date,
           y = avgPrice)) +
  geom_line(aes(colour = flat_type)) +
  facet_wrap(~ town) +
  labs(title = "Trend of Resale Prices by Flat Types in Singapore Towns, 2022") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(fig2)

As seen in the interactive plot above, the average resale prices in towns such as Central Area, Bishan, Clementi and Toa Payoh seem to be very high as when compared to towns such as Choa Chu Kang, Jurong West and Sembawang.

Another way to visualize the resale prices by flat types would be plot a violin plot.

Code
fig7 <- ggbetweenstats(
  data = resaleFlats,
  x = "flat_type",
  y = "resale_price",
  type = "np") +
  ggtitle("Non-Parametric Analysis of Resale Prices by Flat Type")

ggplotly(fig7)

3.5 Resale Prices vs Floor Area

Let’s use the code chunk below to find out if there is a correlation between the resale price and the floor area:

Code
fig4 <- ggscatterstats(
  data = resaleFlats,
  x = floor_area_sqm,
  y = resale_price,
  marginal = FALSE,
  point.args = list(size = 0.5,
                    stroke = 0,
                    colour = "grey"),
  smooth.line.args = list(linewidth = 0.5,
                          colour = "black",
                          method = "lm")) +
  labs(title = "Correlation between Resale Prices and Floor Area, 2022") +
  facet_wrap(~ town)

ggplotly(fig4)

This scatterplot helps to depict the relationship between the two numerical variables resale_price and floor_area_sqm. So how do we read this plot? The angles of the lines depict the relationship between the two variables, and this could be either a positive or negative relationship. At once glance, we can tell that there is a positive correlation between these two variables. However, when looking at towns such as Central Area and Toa Payoh, we can note that the relationship is not as linear as the relationship in other towns.

3.6 Resale Prices vs Floor Area by Month

Let’s use the code chunk below to find out the trend of the resale prices by floor area across the months. We will narrow down our analysis to only focus on 5 Room flats for this visualization.

Code
improvedModels <- resaleFlats %>%
  filter(flat_model == "Improved") %>%
  filter(flat_type == "5 ROOM")

heatmap <- ggplot(improvedModels,
                  aes(x = Date,
                      y = floor_area_sqm,
                      fill = resale_price)) +
  geom_tile() +
  labs(title = "5 Room Flat's Resale Prices by Floor Area in 2022")

ggplotly(heatmap)

From this heatmap, we will be unable to predict the resale price of a 5 room flat for the next month as you can see that there is an inconsistency in the resale prices when compared against the floor area. This could be something that further explore in the future as there could be other factors that influence the resale prices.

This concludes the work submitted for Take-Home Exercise 3. Thank you!